Strawberry Detection Through Deep Learning¶

HS 495 Project 3

Jerry H. Yu

Introduction¶

Computer Vision (CV) tools are receiving increasing attention in agricultural fields after demonstrating their ability to increase throughput and cost effectiveness over traditional phenotyping methods $^{1}$. Computer vision, a subfield of Artificial Intelligence (AI), focuses on deriving meaningful information from visual data like images and videos $^{2,3}$. One application of CV to agricultural use is fruit detection, where a model learns to find and count ripe fruit like strawberries from video or photos of plants in the field $^{4}$. These methods can accelerate plant breeding by automating the process of counting strawberries to assess plant productivity, or guide robots to pick ripe strawberries $^{4}$.

However, applying CV techniques in agricultural settings present unique challenges. Two common hurdles are the limited availability of labeled training data and the computational demands of running large models on less powerful devices in the field $^{3}$. For fruit detection, these challenges are particularly acute, requiring sufficient labeled images of strawberries at varying ripeness levels, under diverse lighting conditions, and accounting for natural occlusion from leaves and stems $^{5,6}$.

The state-of-the-art (SOTA) in computer vision currently relies on two primary classes of deep learning algorithms: Convolutional Neural Networks (CNNs) and Visual Transformers. CNNs, like the popular YOLO, ResNet, and MobileNet families, extract relevant features through a series of convolutions like blurring $^{7,8}$. Throughout the 2010s, CNNs were the dominant approach, and they remain widely used when speed and model size are critical. More recently, Visual Transformers have emerged, processing images by dividing them into tokens and embedding them as vectors, allowing them to leverage transformer-based architectures and incorporate broader contextual information, often at the cost of increasing run (inference) time and model size $^{9,10}$. Models of this type include Vision Transformers (ViT) from Google, Florence2 from Microsoft, and Qwen2.5-VL from Alibaba $^{9}$.

In this class, we were tasked with training a deep learning model to detect ripe strawberries. Previous research has demonstrated the effectiveness of YOLO models in strawberry detection, achieving high accuracies, recall, and F1 scores $^{5,6}$. However, these studies often predate the release of newer YOLO architectures and are trained on much larger labeled datasets. Therefore, this project will evaluate and compare the performance of recent YOLO models (YOLOv11 and YOLOe) alongside a transformer-based Qwen model on a very small dataset $^{9–11}$. Our primary goals are to assess the models’ performance with limited training data, computational cost, and speed, providing insights into the most efficient and effective approach for automated strawberry phenotyping.

Image Preprocessing and Augmentation¶

Preprocessing¶

  • For labeling, me, Shailesh Raj Acharya, and Steve Ameridge collaborated on labeling images.
  • The original data set (V1) contained 63 images.
  • Later, I added 8 additional images. These were mostly null images, or images without any ripe strawberries to try to obtain a better precision. That is, I tried to add images with no ripe strawberries to make the model less likely to identify something that was not a raw strawberry as a raw strawberry. This version is called V4.
  • Data was split with a standard 70-20-10 approach, with 70% of images assigned to training, 20% to validation, and 10% to testing.
  • Images for training were duplicated three times, resulting in three versions of the same image.
  • An example of the types of images used in the dataset are below.
In [15]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, "Data/Project3---Strawberry-1/train/images/39_crop1_png.rf.9ed6f32f0000a8290b62ea41e167cfff.jpg"))
img2 = Image.open(os.path.join(repo.working_tree_dir, "Data/Project3---Strawberry-4/train/images/Fragaria_berried_treasure_pink_everbearing_strawberry_gc_FRABP_03_jpg.rf.e2b63a188d49dfdb3b564a6ddf3bdc73.jpg")
)

# # Resize (optional)
# img1 = img1.resize((300, 300))
# img2 = img2.resize((300, 300))

# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))

# Display image
display(combined)

# Add caption
display(Markdown("**Figure 1.** Left: An example of an image from the original dataset. Right: An example of a null image I added. Notice the color of the flowers to train the model to account for features other than color."))
No description has been provided for this image

Figure 1. Left: An example of an image from the original dataset. Right: An example of a null image I added. Notice the color of the flowers to train the model to account for features other than color.

Augmentation¶

  • Originally, I randomly chose a combination of image augmentations I thought would be beneficial, tending to the high range of what roboflow recommended. This is in addition to the automatic augmentations applied by the train function from Ultralytics.
  • For all models, I added noise of up to 1.05% of pixels.
  • in later versions, I left the images unaugmented for roboflow with the exception of noise, relying on the automatic augmentations only and tuning augmentation parameters alongside other hyperparameters.
  • Table 1 describes all augmentations applied across models that were fine tuned.
Row Name Yolo11 Finetuned Default Yolo11 Finetuned Best of 10 Yolo11 Finetuned Best of 274
degrees rotated 0.0 + 29 0.0 0.0
hsv_h (hue) 0.015 0.01458 0.01287
hsv_s (saturation) 0.7 + 0.21 0.09784 0.09813
hsv_v (value) 0.4 0.1 0.1
translate 0.1 0.1 0.06559
scale 0.5 0.5 0.5
shear 0.0 0.0 0.0
perspective 0.0 0.0 0.0
flipud 0.0 0.0 0.0
fliplr 0.5 0.5 0.5
bgr 0.0 0.0 0.0
mosaic 1.0 1.0 1.0
mixup 0.0 0.0 0.0
copy_paste 0.0 0.0 0.0
copy_paste_mode flip flip flip
auto_augment randaugment randaugment randaugment
erasing 0.4 0.4 0.4
crop_fraction 1.0 1.0 1.0

Table 1: All Augmentation settings for models that were fine tuned. Bold indicates differences between models, while the + sign indicates augmentations from roboflow, which were only applied in the first model. For further information on augmentation hyperparameters please refer to the Roboflow docs here $^{12}$.

Model Selection¶

I chose three models for my analysis. A description for each model is given below:

  • YOLOv11 is the latest CNN-based model from Ultralytics, optimized for both speed and accuracy in object detection tasks¹¹. Compared to the YOLO versions shown in class, YOLO11 employs a series of architectural tweaks that make parts of the model more efficient by requiring fewer parameters, enabling faster inference without significantly affecting performance¹³.

    • I chose the large version to increase the model's base performance. All YOLO models are relatively easy to train with my hardware, so picking the one with the best results was preferable¹¹,¹².
  • YOLOE builds on YOLO11 by incorporating two sub-networks that allow YOLOE to detect objects from custom descriptions without additional training¹⁰. These components are as follows:

    • SAVPE (Semantic-Activated Visual Prompt Encoder) enhances visual salience by linking image regions to semantic features from text prompts¹⁴.

    • RepRTA (Re-parameterizable Region-Text Alignment) transforms text descriptions (e.g., “red ripe strawberry”) into high-dimensional embeddings, aligned with CLIP-like vectors, enabling the model to detect custom or unseen objects without retraining¹⁴.

  • Qwen-2.5 VL 3b Instruct is a multimodal vision language model that combines a customized Vision Transformer (ViT) with the QWen 2.5 LLM⁹. Qwen's ViT uses enhanced positional embeddings and multimodal fusion layers to align image features with text queries. It's pre-trained on large image-text datasets and then on vision-language tasks such as region detection¹⁴.

    • As it is an MLLM, this model is much larger than the other two¹⁰,¹¹,¹⁵.

I feel that these models give us a good overview of different SOTA CV approaches, representing the trade-off between contextual understanding and processing speed.

Results¶

In [3]:
# Checking GPU
!nvidia-smi
# Check torch version and compatability
print(torch.version.cuda)
print(torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())

Test Yolo11 Pretrained¶

Note: We used an image of a cat with mangosteens to demo each model's abilities without any training (zero shot).

In [8]:
from ultralytics import YOLO
# checks if ultralytics is up to date and if the GPU is available
ultralytics.checks()
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
Setup complete  (32 CPUs, 95.7 GB RAM, 994.4/1862.2 GB disk)
In [9]:
# Load Pretrained YOLO11
yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt")) 
pic = Image.open(os.path.join(repo.working_tree_dir,"Data/mangosteencat.jpeg"))
result = yolo11.predict(pic,conf=0.25)[0]
detections = sv.Detections.from_ultralytics(result)
0: 640x640 1 cat, 1 cup, 1 bowl, 1 apple, 10.9ms
Speed: 9.9ms preprocess, 10.9ms inference, 128.8ms postprocess per image at shape (1, 3, 640, 640)
In [10]:
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)

annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=detections)
annotated_image = label_annotator.annotate(annotated_image, detections=detections)

sv.plot_image(annotated_image, size=(10, 10))
No description has been provided for this image
  • You can see that YOLO11 is able to detect quite a few objects and correctly.
  • However, the model's vocabulary is limited, resulting in it identifying mangosteens as apples.
In [11]:
# Now Test on Strawberries
pic = Image.open(os.path.join(repo.working_tree_dir,"Data/Project3---Strawberry-1/test/images/27_crop2_png.rf.89569ab2704bf6aa15acab6150fe12b8.jpg"))
result = yolo11.predict(pic,conf=0.25)[0]
detections = sv.Detections.from_ultralytics(result)
0: 640x640 1 apple, 12.5ms
Speed: 1.7ms preprocess, 12.5ms inference, 2.3ms postprocess per image at shape (1, 3, 640, 640)
In [12]:
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)

annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=detections)
annotated_image = label_annotator.annotate(annotated_image, detections=detections)

sv.plot_image(annotated_image, size=(10, 10))
No description has been provided for this image
  • very inaccurate, seems think that all fruits are apples.

Start Finetune¶

  • Now that we've seen YOLO11 zero shot effectivness on our data, let's fine tune it.
In [ ]:
# Import Roboflow Project Version 1
os.chdir(repo.working_tree_dir + "\\Data")
rf = Roboflow(api_key ="api")
rf.workspace().project("project3-strawberry").version(1).download("yolov11")
loading Roboflow workspace...
loading Roboflow project...
Out[ ]:
<roboflow.core.dataset.Dataset at 0x21bb7f29250>
In [39]:
# Train the model under default settings
os.chdir(repo.working_tree_dir)
v1yolo11 = yolo11.train(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"), 
                        model = os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"),
                       epochs=100, imgsz=640,name="V1default",val=False)
  • took 2 minutes and 9 seconds

Error Analysis¶

Now Let's Validate our Model!

In [22]:
os.chdir(repo.working_tree_dir + "\Validation")
v1yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/finetune/V1default.pt")) 
metricsdefault = v1yolo11.val(data = os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"),
                             plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLO11n summary (fused): 100 layers, 2,582,347 parameters, 0 gradients, 6.3 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Project3---Strawberry-1\valid\labels.cache... 12 images, 0 backgrounds, 0 corrupt: 100%|██████████| 12/12 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:02<00:00,  2.23s/it]
                   all         12        113      0.715      0.735      0.718      0.335
Speed: 0.1ms preprocess, 1.2ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to runs\detect\val2
  • For default setting with few epochs, the mAP50 (0.718) and mAP50-95 (0.335) are not bad.
  • We take these values as a baseline
In [ ]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/val/confusion_matrix.png"))
img2 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/val/F1_curve.png")
)
img3 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/val/PR_curve.png")
)
# # Resize (optional)
img1 = img1.resize((2000,1500))
# img2 = img2.resize((300, 300))

# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width + img3.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
combined.paste(img3, (img1.width + img2.width, 0))

# Display image
display(combined)

# Add caption
display(Markdown("**Figure 2.** The confusion matrix, F1 curve, and PR "))
No description has been provided for this image

Figure 2. The confusion matrix, F1 curve, and PR

  • From these graphs, we see that the number of false positives seems to outnumber the number of false negatives.
  • F1 also attains a stable value quickly, before a slower drop off. This is mirrored in the PR curve where the plateuing of precision as recall increases is earlier than the plateuing of recall as precision increases.
    • This indicates that the majority of false positives get lower confidence, whereas many true positives get mid confidences.
    • This could then indicate that many true positives are harder to detect.

Let's check this

In [35]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/val/val_batch0_labels.jpg"))
img2 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/val/val_batch0_pred.jpg")
)

# # Resize (optional)
# img1 = img1.resize((2000,1500))
# img2 = img2.resize((300, 300))

# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))

# Display image
display(combined)

# Add caption
display(Markdown("**Figure 2.** Validation Plots. Left:labels. Right: Predictions."))
No description has been provided for this image

Figure 2. Validation Plots. Left:labels. Right: Predictions.

  • From a simple visual inspection of the labels and off tables, we can see examples of why the model is confused.
    • For one, in the correct labels there are varying degrees of ripeness that were annotated as ripe. For instance, the pink strawberry for the second image from the top farthest to the right has a prominent pink strawberry. This strawberry seems no more or less ripe than the pink strawberry on the top left image, but one was labeled ripe and the other was not.
    • Partially occluded strawberries are also an issue to spot; I notice for some instances that the model would bound more tightly on the red portions of covered strawberries, while human labels might include the estimated shape of the ripe strawberry to include areas covered by green leaves.

Overall, this error analysis casts doubt on whether the model will be able to achieve much higher scores, given the small size and inconsistent labeling of the training data.

In [40]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "runs/detect/V1default2/results.png"))

# Display image
display(img1)

# Add caption
display(Markdown("**Figure 3.** Training Metrics of the default finetuned model"))
No description has been provided for this image

Figure 3. Training Metrics of the default finetuned model

  • Given that loss did not seem to plateau after 100 epochs, I decided to try again and train my model with more epochs.

More Epochs Training Run¶

In [23]:
# Train the model under default settings with more epochs
os.chdir(repo.working_tree_dir)
v1yolo11 = yolo11.train(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"), 
                        model = os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"),
                       epochs=1000, imgsz=640,name="V1more_epochs",val=False)
  • Took 16 minutes and 37 seconds
In [26]:
os.chdir(repo.working_tree_dir + "\Validation")
V1yolo11more_epochs = YOLO(os.path.join(repo.working_tree_dir,"Models/finetune/V1yolo11more_epochs.pt")) 
metricsdefault = V1yolo11more_epochs.val(data = os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-1\\data.yaml"),
                             plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLO11n summary (fused): 100 layers, 2,582,347 parameters, 0 gradients, 6.3 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Project3---Strawberry-1\valid\labels.cache... 12 images, 0 backgrounds, 0 corrupt: 100%|██████████| 12/12 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:02<00:00,  2.25s/it]
                   all         12        113      0.754      0.673      0.705      0.345
Speed: 0.1ms preprocess, 1.2ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to runs\detect\val2
  • As you can see, training the model with more epochs did not seem to make much of a difference on actual performance. Though the model became more confident in its predictions, it still misclassified images at a similar rate. Additionally, while precision increased, recall decreased by a lot more.

Tune Yolo11 with inbuilt YOLO Tune Function¶

In my next step, I decided to use the YOLO11 inbuilt tune function to try to optimize my augmentation and training hyperparameters. However, to challenge the model, I also decided to add some null images.

Changes

  • Added null images (~10%) to deal with false positives
  • Removed most automatic roboflow augmentations; allow us to tune usage of augmentation using the tune function
In [6]:
# Load in Modified Dataset 4
os.chdir(os.getcwd() + "\\Data")
rf = Roboflow(api_key ="api")
rf.workspace().project("project3-strawberry").version(4).download("yolov11")
loading Roboflow workspace...
loading Roboflow project...
Out[6]:
<roboflow.core.dataset.Dataset at 0x1fe79079890>
In [ ]:
from ultralytics import YOLO
yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"))
srch_space = {
        "lr0":(1e-5, 1e-1),         # Initial learning rate
        "lrf":(0.01, 1.0),          # Final learning rate factor
        "momentum":(0.6, 0.98),     # Momentum
        "weight_decay":(0.0, 0.001),# Weight decay
        "warmup_epochs":(0.0, 5.0), # Warmup epochs
        "box":(0.02, 0.2),          # Box loss weight
        "cls":(0.2, 4.0),           # Class loss weight
        "hsv_h":(0.0, 0.1),         # Hue augmentation
        "hsv_s":(0.0, 0.1),         # Saturation augmentation (custom)
        "hsv_v":(0.0, 0.1),         # Brightness augmentation (custom)
        "translate":(0.0, 0.9)      # Translation range
    }
model_tune = yolo11.tune(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-4\\data.yaml"), 
                            space=srch_space,
                            epochs=100,
                            iterations=10,
                            plots=False,
                            save=False,
                            val=False,
                            batch=32
                            )
In [41]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "runs/detect/tune/tune_fitness.png"))

# Display image
display(img1)

# Add caption
display(Markdown("**Figure 3.** Fitness of models undergoing hyperparameter tuning."))
No description has been provided for this image

Figure 3. Fitness of models undergoing hyperparameter tuning.

  • Took 47 minutes and 52 seconds

So Hyperparameter Tuning is Expensive! YOLO's base genetic algorithm is also inefficient, with no inbuilt early stopping and limited parallelizability. Optimal hyperparameters were identified in the second iteration, suggesting an insufficient search effort for such a large parameter space of 11 parameters. The performance of the best model is evaluated below.

In [47]:
os.chdir(repo.working_tree_dir + "\Validation")

yolo11bo10 = YOLO(os.path.join(repo.working_tree_dir,"Models/finetune/yolo11besto10.pt")) 
metricsdefault = yolo11bo10.val(data = os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-4\\data.yaml"),
                             plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLO11l summary (fused): 190 layers, 25,280,083 parameters, 0 gradients, 86.6 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Project3---Strawberry-4\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:02<00:00,  2.31s/it]
                   all         14        113      0.809      0.712      0.733      0.384
Speed: 0.3ms preprocess, 9.7ms inference, 0.0ms loss, 0.7ms postprocess per image
Results saved to runs\detect\val3
  • This is a different dataset, so metrics are not directly comparable.
  • However, in general P,R,mAP50 and mAP50-95 are all higher.

At this point, I decided I would try a different approach to tuning by shortening the epochs each model was trained on to 30 and increasing the number of iterations I ran the genetic algorithm to 300. I achieved 274 iterations before I decided to stop prematurely, because at that point tuning had taken some 12 hours and class was about to start.

In [ ]:
from ultralytics import YOLO
yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt"))
srch_space = {
        "lr0":(1e-5, 1e-1),         # Initial learning rate
        "lrf":(0.01, 1.0),          # Final learning rate factor
        "momentum":(0.6, 0.98),     # Momentum
        "weight_decay":(0.0, 0.001),# Weight decay
        "warmup_epochs":(0.0, 5.0), # Warmup epochs
        "box":(0.02, 0.2),          # Box loss weight
        "cls":(0.2, 4.0),           # Class loss weight
        "hsv_h":(0.0, 0.1),         # Hue augmentation
        "hsv_s":(0.0, 0.1),         # Saturation augmentation (custom)
        "hsv_v":(0.0, 0.1),         # Brightness augmentation (custom)
        "translate":(0.0, 0.9)      # Translation range
    }
model_tune = yolo11.tune(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-4\\data.yaml"), 
                            space=srch_space,
                            epochs=30,
                            iterations=300,
                            plots=False,
                            save=False,
                            val=False,
                            batch=32
                            )
In [ ]:
# Train on Best Hyperparameters
os.chdir(repo.working_tree_dir)

from ultralytics import YOLO
v3yolo11 = YOLO(os.path.join(repo.working_tree_dir,"Models/pretrain/yolo11l.pt")) 
v3yolo11.train(data=os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-4\\data.yaml"), 
                epochs=100, 
                imgsz=640,
                name="V3bestof274",
                lr0=0.00794,
                lrf=0.01,
                momentum = 0.98,
                weight_decay = 0.00045,
                warmup_epochs = 2.6986,
                optimizer = "AdamW",
                box = 0.2,
                cls = 0.37206,
                hsv_h= 0.01287,
                hsv_s= 0.09813,
                hsv_v= 0.1,
                translate= 0.06559)
In [46]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "runs/detect/tune3/tune_fitness.png"))

# Display image
display(img1)

# Add caption
display(Markdown("**Figure 4.** Fitness of models undergoing hyperparameter tuning."))
No description has been provided for this image

Figure 4. Fitness of models undergoing hyperparameter tuning.

In [48]:
os.chdir(repo.working_tree_dir + "\Validation")

yolo11bo274 = YOLO(os.path.join(repo.working_tree_dir,"Models/finetune/yolo11besto274.pt")) 
metricsdefault = yolo11bo274.val(data = os.path.join(repo.working_tree_dir,"Data\\Project3---Strawberry-4\\data.yaml"),
                             plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLO11l summary (fused): 190 layers, 25,280,083 parameters, 0 gradients, 86.6 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Project3---Strawberry-4\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:02<00:00,  2.36s/it]
                   all         14        113      0.755      0.681      0.717      0.367
Speed: 0.1ms preprocess, 7.3ms inference, 0.0ms loss, 1.1ms postprocess per image
Results saved to runs\detect\val3
  • Once again hyperparameter tuning seems random.
  • However, surprisingly the best of 10 model beat the best of 274 model.
    • Whie this could be random, it also suggests that optimal hyperparameters from tuning a model with fewer epochs does not generalize well to a model trained with more epochs.
  • The best iteration was reached at iteration 98.
  • I can only conclude that
    1. Our search space was too high-dimensional (too many parameters) which made optimization difficult, especially given the time constraints.
    2. The inherent ambiguity in our labels, combined with the sparsity of our data, significantly hindered our ability to identify substantially improved fitness from superior hyperparameter configurations.
  • Both reasons are well supported in this project.

Now let's take a closer look at validation for the best of 10 model (first tuning run result).

In [50]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/y11v10s4/confusion_matrix.png"))
img2 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/y11v10s4/F1_curve.png")
)
img3 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/y11v10s4/PR_curve.png")
)
# # Resize (optional)
img1 = img1.resize((2000,1500))
# img2 = img2.resize((300, 300))

# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width + img3.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
combined.paste(img3, (img1.width + img2.width, 0))

# Display image
display(combined)

# Add caption
display(Markdown("**Figure 5.** The confusion matrix, F1 curve, and PR of the best `YOLO11` model with optimized hyperparameters."))
No description has been provided for this image

Figure 5. The confusion matrix, F1 curve, and PR of the best YOLO11 model with optimized hyperparameters.

  • In general, this model seems better at maintaining a higher precision as recall increases to around 0.75. This is borne out in the precision not dropping suddenly until around 0.75 recall.
    • So it seems like adding more null images and tuning hyperparameters was somewhat successful, decreasing the FPs in the confusion matrix by around 10 images while only increasing FNs by 5.

Zero Shot YOLOe¶

  • Now we load and test the YOLOe model. More information can be found here.
In [124]:
# Load Pretrained YOLO11e
from ultralytics import YOLOE
yolo11edemo1 = YOLOE(os.path.join(repo.working_tree_dir,"Models/pretrain/yoloe-11l-seg.pt")).cuda()
yolo11edemo2 = YOLOE(os.path.join(repo.working_tree_dir,"Models/pretrain/yoloe-11l-seg.pt")).cuda()
yolo11edemo3 = YOLOE(os.path.join(repo.working_tree_dir,"Models/pretrain/yoloe-11l-seg.pt")).cuda()
In [125]:
# Load Custom Labels
os.chdir(repo.working_tree_dir + "\\Models\\pretrain")
cnames1 = ["hat","cat eye","mangosteen"]
cnames2 = ["straw hat","black fruit mangosteen","brown cat eye"]
cnames3 = ["black fruit hat","black fruit","green cat eye"]
# Set custom labels
yolo11edemo1.set_classes(cnames1,yolo11edemo1.get_text_pe(cnames1))
yolo11edemo2.set_classes(cnames2,yolo11edemo2.get_text_pe(cnames2))
yolo11edemo3.set_classes(cnames3,yolo11edemo3.get_text_pe(cnames3))
In [126]:
# Load the Demo Image
pic = Image.open(os.path.join(repo.working_tree_dir,"Data\\mangosteencat.jpeg"))
demo1 = yolo11edemo1.predict(pic,conf=0.1)[0]
demo1 = sv.Detections.from_ultralytics(demo1)

demo2 = yolo11edemo2.predict(pic,conf=0.1)[0]
demo2 = sv.Detections.from_ultralytics(demo2)

demo3 = yolo11edemo3.predict(pic,conf=0.1)[0]
demo3 = sv.Detections.from_ultralytics(demo3)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLOe-11l-seg summary (fused): 227 layers, 35,117,862 parameters, 2,254,374 gradients, 144.1 GFLOPs

0: 640x640 2 hats, 15.3ms
Speed: 3.7ms preprocess, 15.3ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLOe-11l-seg summary (fused): 227 layers, 35,117,862 parameters, 2,254,374 gradients, 144.1 GFLOPs

0: 640x640 1 straw hat, 1 black fruit mangosteen, 2 brown cat eyes, 13.9ms
Speed: 2.2ms preprocess, 13.9ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 640)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLOe-11l-seg summary (fused): 227 layers, 35,117,862 parameters, 2,254,374 gradients, 144.1 GFLOPs

0: 640x640 1 black fruit hat, 3 black fruits, 1 green cat eye, 14.4ms
Speed: 2.3ms preprocess, 14.4ms inference, 1.9ms postprocess per image at shape (1, 3, 640, 640)
In [75]:
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)

# P1
annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=demo1)
annotated_image = label_annotator.annotate(annotated_image, detections=demo1)

# P2
annotated_image2 = pic.copy()
annotated_image2 = box_annotator.annotate(annotated_image2, detections=demo2)
annotated_image2 = label_annotator.annotate(annotated_image2, detections=demo2)

# P3
annotated_image3 = pic.copy()
annotated_image3 = box_annotator.annotate(annotated_image3, detections=demo3)
annotated_image3 = label_annotator.annotate(annotated_image3, detections=demo3)

# Combine side by side
combined = Image.new("RGB", (annotated_image.width + annotated_image2.width + annotated_image3.width, annotated_image.height))
combined.paste(annotated_image, (0, 0))
combined.paste(annotated_image2, (annotated_image.width, 0))
combined.paste(annotated_image3, (annotated_image.width + annotated_image2.width, 0))

# Display image
display(combined)

# sv.plot_image(annotated_image, size=(10, 10))
No description has been provided for this image
  • This demo displays some tests I did of the zero-shot ability of YOLO11e.
  • The approach demonstrates the importance of understanding the vocabulary of the RepRTA module.
    • Concepts (like mangosteen) may be completely absent from the vocabulary and hinder identification
    • Other concepts (like cat's eye) may require more descriptive information, though that is not always even correct (for example, adding a description of "green" to cat's eye in the third image results in the cat's eyes being IDed, though the cat's eyes are brown)

Now let's try with strawberries!

In [127]:
# Load Custom Labels
from ultralytics import YOLOE
os.chdir(repo.working_tree_dir + "\\Models\\pretrain")
cnames1 = ["ripe strawberry"]
cnames2 = ["red ripe strawberry"]
# Set custom labels
yolo11edemo1.set_classes(cnames1,yolo11edemo1.get_text_pe(cnames1))
yolo11edemo2.set_classes(cnames2,yolo11edemo2.get_text_pe(cnames2))
In [115]:
# Load the Demo Image
demo1 = yolo11edemo1.predict(pic,conf=0.1)[0]
demo1 = sv.Detections.from_ultralytics(demo1)

demo2 = yolo11edemo2.predict(pic,conf=0.1)[0]
demo2 = sv.Detections.from_ultralytics(demo2)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)

0: 640x640 5 ripe strawberrys, 13.6ms
Speed: 1.9ms preprocess, 13.6ms inference, 6.1ms postprocess per image at shape (1, 3, 640, 640)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)

0: 640x640 8 red ripe strawberrys, 14.0ms
Speed: 1.5ms preprocess, 14.0ms inference, 1.4ms postprocess per image at shape (1, 3, 640, 640)
In [116]:
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator(text_color=sv.Color.BLACK)

# P1
annotated_image = pic.copy()
annotated_image = box_annotator.annotate(annotated_image, detections=demo1)
annotated_image = label_annotator.annotate(annotated_image, detections=demo1)

# P2
annotated_image2 = pic.copy()
annotated_image2 = box_annotator.annotate(annotated_image2, detections=demo2)
annotated_image2 = label_annotator.annotate(annotated_image2, detections=demo2)

# Combine side by side
combined = Image.new("RGB", (annotated_image.width + annotated_image2.width + annotated_image3.width, annotated_image.height))
combined.paste(annotated_image, (0, 0))
combined.paste(annotated_image2, (annotated_image.width, 0))

# Display image
display(combined)

# sv.plot_image(annotated_image, size=(10, 10))
No description has been provided for this image
  • As you can see, it seems like adding the descriptor "red" increases recall significantly.

Now let's validate!

In [ ]:
# Load Pretrained YOLO11
yolo11e = YOLOE(os.path.join(repo.working_tree_dir,"Models/pretrain/yoloe-11l-seg.pt")).cuda()
In [144]:
os.chdir(repo.working_tree_dir + "\\Models\\pretrain")
yolo11edet = YOLOE("yoloe-11l.yaml")
state = torch.load(os.path.join(repo.working_tree_dir,"Models/pretrain/yoloe-11l-seg.pt"))
yolo11edet.load(state["model"])
In [146]:
os.chdir(repo.working_tree_dir + "\Validation")
metrics3 = yolo11edet.val(data = os.path.join(repo.working_tree_dir,"Data\\Strawberrybasevalnored\\data.yaml"),
                       load_vp=True,
                       conf =0.5,
                       plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
Validate using the visual prompt.
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Strawberrybasevalnored\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
Get visual prompt embeddings from samples: 100%|██████████| 1/1 [00:00<00:00,  6.06it/s]
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLOe-11l summary (fused): 215 layers, 32,812,582 parameters, 2,492,550 gradients, 88.8 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Strawberrybasevalnored\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:03<00:00,  3.17s/it]
                   all         14        113      0.635      0.699      0.691       0.45
Speed: 0.7ms preprocess, 8.9ms inference, 0.0ms loss, 1.4ms postprocess per image
Results saved to runs\detect\val4
In [176]:
metrics3 = yolo11edet.val(data = os.path.join(repo.working_tree_dir,"Data\\Strawberrybaseval\\data.yaml"),
                       load_vp=True,
                       conf =0.5,
                       plots=True)
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
Validate using the visual prompt.
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Strawberrybaseval\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
Get visual prompt embeddings from samples: 100%|██████████| 1/1 [00:00<00:00,  7.59it/s]
Ultralytics 8.3.105  Python-3.11.11 torch-2.8.0.dev20250407+cu128 CUDA:0 (NVIDIA GeForce RTX 3090, 24576MiB)
YOLOe-11l summary (fused): 215 layers, 32,812,582 parameters, 2,492,550 gradients, 88.8 GFLOPs
val: Scanning C:\Users\Public\Documents\Class_Projects\Strobbery\Data\Strawberrybaseval\valid\labels.cache... 14 images, 2 backgrounds, 0 corrupt: 100%|██████████| 14/14 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:04<00:00,  4.16s/it]
                   all         14        113      0.635      0.699      0.691       0.45
Speed: 0.1ms preprocess, 6.6ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to runs\detect\val4
  • Surprisingly, it turns out the metrics for "ripe strawberry" and "red ripe strawberry" are the exact same!
  • Otherwise, metrics across the board are lower than those of the fine-tuned model, but the mAP50-95 is highest out of all models significantly.
    • This indicates that while the model is less able to detect strawberries accurately at the 50 level, the boxes it detects fit close to the ground truth, resulting in higher scores at higher thresholds.
In [ ]:
# Load images
img1 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/valredstraw/F1_curve.png"))
img2 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/valredstraw/P_curve.png")
)
img3 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/valredstraw/PR_curve.png")
)
img4 = Image.open(os.path.join(repo.working_tree_dir, 
                               "Validation/runs/detect/valredstraw/R_curve.png")

)

# Combine side by side
combined = Image.new("RGB", (img1.width + img2.width + img3.width + img4.width, max(img1.height, img2.height)))
combined.paste(img1, (0, 0))
combined.paste(img2, (img1.width, 0))
combined.paste(img3, (img1.width + img2.width, 0))
combined.paste(img4, (img1.width + img2.width + img3.width, 0))

# Display image
display(combined)

# Add caption
display(Markdown("**Figure 6.** The F1 curve, P,R and PR curves of the YOLOE Model."))
No description has been provided for this image

Figure 5. The F1 curve, P,R and PR curves of the YOLOE Model.

  • An interesting difference between the curves of this model and the others is the general consistency of the P,R, and PR curves.
  • There is a lot less stair stepping. perhaps indicating a worse fit than with fine-tuned models.
  • Recall is significantly worse than precision it seems for this model.

Zero Shot With QWEN2.5 VL 3b¶

Now I wanted to try VLM, specifically QWEN2.5 VL. My GPU could only fit the 3B parameter version but that is overkill for bounding boxes.

In [ ]:
# Run with Maestro
os.chdir(repo.working_tree_dir)
from maestro.trainer.models.qwen_2_5_vl.checkpoints import load_model, OptimizationStrategy

MODEL_ID_OR_PATH = "Qwen/Qwen2.5-VL-3B-Instruct"
MIN_PIXELS = 512 * 28 * 28
MAX_PIXELS = 2048 * 28 * 28

processor, model = load_model(
    model_id_or_path=MODEL_ID_OR_PATH,
    optimization_strategy=OptimizationStrategy,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS
)
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
In [ ]:
# Create the Base Function for Running Inference
from typing import Optional, Tuple, Union

from maestro.trainer.models.qwen_2_5_vl.inference import predict_with_inputs
from maestro.trainer.models.qwen_2_5_vl.loaders import format_conversation
from maestro.trainer.common.utils.device import parse_device_spec
from qwen_vl_utils import process_vision_info

def run_qwen_2_5_vl_inference(
    model,
    processor,
    image: Image.Image,
    prompt: str,
    system_message: Optional[str] = None,
    device: str = "auto",
    max_new_tokens: int = 1024,
) -> Tuple[str, Tuple[int, int]]:
    
    device = parse_device_spec(device)
    conversation = format_conversation(image=image, prefix=prompt, system_message=system_message)
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(conversation)

    inputs = processor(
        text=text,
        images=image_inputs,
        return_tensors="pt",
    )

    input_h = inputs['image_grid_thw'][0][1] * 14
    input_w = inputs['image_grid_thw'][0][2] * 14

    response = predict_with_inputs(
        **inputs,
        model=model,
        processor=processor,
        device=device,
        max_new_tokens=max_new_tokens
    )[0]

    return response, (input_w, input_h)
In [ ]:
# Run Example
IMAGE_PATH = os.path.join(repo.working_tree_dir,"Data\\mangosteencat.jpeg")
SYSTEM_MESSAGE = None
PROMPT = "Describe the image in detail. Include the number of objects, " \
"their colors, and any other relevant information. Do this in English and then in Chinese."

image = Image.open(IMAGE_PATH)
resolution_wh = image.size
response, input_wh = run_qwen_2_5_vl_inference(
    model=model,
    processor=processor,
    image=image,
    prompt=PROMPT,
    system_message=SYSTEM_MESSAGE
)

print(response)
The image features a cat wearing a straw hat adorned with a mangosteen fruit on top. The cat is sitting on a woven mat surrounded by various mangosteens arranged in a circular pattern. In the background, there is a glass of red liquid, likely mangosteen juice, placed on a woven basket. The setting appears to be an orchard with trees in the distance, bathed in sunlight filtering through the leaves.

### English Description:
- **Cat**: Wearing a straw hat with a mangosteen fruit on top.
- **Mangosteens**: Arranged in a circular pattern around the cat.
- **Glass of Red Liquid**: Positioned on a woven basket.
- **Background**: An orchard with trees, sunlight filtering through the leaves.

### Chinese Description:
- **猫(cat)**: 戴着一顶草帽,草帽上有一个木瓜。
- **木瓜(mangosteen)**: 周围摆放成一个圆形图案。
- **玻璃杯中的红色液体(glass of red liquid)**: 放在编织篮子里。
- **背景**:一个果园,树木在远处,阳光透过树叶洒下。
  • As expected of a language model, it is a lot more versatile than the CNN based models, able to generate text.
  • It's vocabulary is also better than YOLOE's in English? Calls a mangosteen a papaya.
In [7]:
IMAGE_PATH = os.path.join(repo.working_tree_dir,"Data\\mangosteencat.jpeg")
SYSTEM_MESSAGE = None
PROMPT = "Outline the position of each of the cat's eyes, its hat, and all individual mangosteens and output all the coordinates in JSON format. "

image = Image.open(IMAGE_PATH)
resolution_wh = image.size
response, input_wh = run_qwen_2_5_vl_inference(
    model=model,
    processor=processor,
    image=image,
    prompt=PROMPT,
    system_message=SYSTEM_MESSAGE
)

print(response)
```json
[
	{"bbox_2d": [450, 518, 503, 556], "label": "cat's eyes"},
	{"bbox_2d": [570, 509, 622, 550], "label": "cat's eyes"},
	{"bbox_2d": [267, 291, 802, 578], "label": "hat"},
	{"bbox_2d": [0, 665, 1016, 997], "label": "mangosteens"}
]
```
In [9]:
import supervision as sv
detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_2_5_VL,
    result=response,
    input_wh=input_wh,
    resolution_wh=resolution_wh
)

box_annotator = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX)
label_annotator = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX)

annotated_image = image.copy()
annotated_image = box_annotator.annotate(scene=annotated_image, detections=detections)
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections)
annotated_image
Out[9]:
No description has been provided for this image
  • Took 5 seconds.
  • QWEN2.5 VL is also fine-tuned on bounding box generation, so it outputs bounding boxes easily.
  • Just from a visual inspection, precision seems great!
  • All models seem to miss the mangosteen on top of the cat's head, though this model noticed it in the text description.
  • Try with Strawberries!
In [115]:
IMAGE_PATH = os.path.join(repo.working_tree_dir,"Data\\Strawberrybaseval\\valid\images\\6_crop1_png.rf.9c9cee8dd7e22d359733f58642df7bf2.jpg")
SYSTEM_MESSAGE = "You are an expert strawberry picker. Every day you walk in fields of strawberries brushing past the leaves to pick them as they are just ripe and deep red. You are paid 10,000 dollars for every ripe strawberry you pick at perfect ripeness. You love finding hidden red strawberries among green leaves"

PROMPT = "Outline the position of each red ripe strawberry and output all the coordinates in JSON format. " \
"Include blurry red strawberries in the background or partially covered by leaves but not half ripe berries that are still partially green. DETECT ALL RED STRAWBERRY AREAS."

image = Image.open(IMAGE_PATH)
resolution_wh = image.size
response, input_wh = run_qwen_2_5_vl_inference(
    model=model,
    processor=processor,
    image=image,
    prompt=PROMPT,
    system_message=SYSTEM_MESSAGE
)
In [116]:
import supervision as sv
detections = sv.Detections.from_vlm(
    vlm=sv.VLM.QWEN_2_5_VL,
    result=response,
    input_wh=input_wh,
    resolution_wh=resolution_wh
)

box_annotator = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX)
label_annotator = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX)

annotated_image = image.copy()
annotated_image = box_annotator.annotate(scene=annotated_image, detections=detections)
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections)
annotated_image
Out[116]:
No description has been provided for this image

Now let's Validate!

In [153]:
# Read YOLO Validation Dataset into Supervision Form
ground_truth_det = sv.DetectionDataset.from_yolo(
     images_directory_path=f"{repo.working_tree_dir}/Data/Strawberrybaseval/valid/images",
     annotations_directory_path=f"{repo.working_tree_dir}/Data/Strawberrybaseval/valid/labels",
     data_yaml_path=f"{repo.working_tree_dir}/Data/Strawberrybaseval/data.yaml"
)
In [ ]:
# Wrapper Functions to Enable Image Validation
# Convert Image to PIL Image
def convertimage(image)->Image.Image:
    """
    Convert a PIL image to a format suitable for the model.
    """
    if isinstance(image, np.ndarray):
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return Image.fromarray(image)
    else:
        return image
In [ ]:
# Wrapper Functions to Enable Image Validation
def strawberry_detect(image):
    image = convertimage(image)
    # Resize image
    resolution_wh = image.size
    # Run VLLM
    response, input_wh = run_qwen_2_5_vl_inference(
    model=model,
    processor=processor,
    image=image,
    prompt=PROMPT,
    system_message=SYSTEM_MESSAGE
    )
    # Create Detrections object
    detections = sv.Detections.from_vlm(
        vlm=sv.VLM.QWEN_2_5_VL,
        result=response,
        input_wh=input_wh,
        resolution_wh=resolution_wh
    )
    detections.class_id = np.array([0]*len(detections.xyxy))
    detections.confidence = np.array([1]*len(detections.xyxy))
    return detections
In [148]:
# Set Prompts
IMAGE_PATH = os.path.join(repo.working_tree_dir,"Data\\Strawberrybaseval\\valid\images\\28_crop1_png.rf.f56aea5e8772091fab239515e0b1f808.jpg")
SYSTEM_MESSAGE = "You are an expert strawberry picker. Every day you walk in fields of strawberries brushing past the leaves to pick them as they are just ripe and deep red. You are paid 10,000 dollars for every ripe strawberry you pick at perfect ripeness. You love finding hidden red strawberries among green leaves"

PROMPT = "Outline the position of each red ripe strawberry and output all the coordinates in JSON format. " \
"Include blurry red strawberries in the background or partially covered by leaves but not half ripe berries that are still partially green. DETECT ALL RED STRAWBERRY AREAS."
In [154]:
# Get Confusion Matrix
cm = sv.ConfusionMatrix.benchmark(ground_truth_det,
                                  strawberry_detect,
                                  iou_threshold=0.5,
                                  conf_threshold=0.5)
In [ ]:
plot = cm.plot()
# Add caption
display(Markdown("**Figure 7.** Confusion Matrix of QWen 2.5 VL Model"))

Figure 6. Confusion Matrix of QWen 2.5 VL Model

No description has been provided for this image
In [156]:
map = sv.MeanAveragePrecision.benchmark(
    dataset = ground_truth_det,
    callback = strawberry_detect
)
In [157]:
print(map.map50)
print(map.map50_95)
0.5419218199474471
0.3038236218116333
  • Contrary to the demonstration appearances, the Image Detection Ability of QWen 2.5-VL is quite bad.
  • Out of all the models the worst mAP50 and mAP50_95 results and it's not even close.
  • The model's recall is better than its precision for sure.
    • However, this is largely prompt dependent as well.

Discussion¶

Preprocessing¶

A fundamental limit of this exploration was the lack and inconsistency of labeled data. As demonstrated multiple times, the number of training examples scales directly with model performance $^{16,17}$. Thus the performance of our fine-tuned models was inferior to results reported on similar studies’ models trained on larger datasets $^{5,6}$.

Figure 8: An example of an edge positive (A) or blurry background ripe strawberry and an edge negative (B) or an almost ripe strawberry

Figure 8: An example of an edge positive (A) or blurry background ripe strawberry and an edge negative (B) or an almost ripe strawberry

An additional impediment to model performance was due to inconsistencies in data labeling, especially the lack of a unified policy among labelers for edge cases. Specifically, there was inconsistency in classifying partially obscured or almost ripe strawberries as either ripe or not. This inconsistency disproportionately affected recall performance, likely due to the higher prevalence of covered or background strawberries in the validation dataset compared to nearly ripe strawberries. To address this in future work, establishing and enforcing rigorous labeling protocols with clear guidelines for handling edge cases will be paramount, though not sufficient, if training data size remains small.

While utilizing zero shot learning with open vocabulary was evaluated, such approaches were not panaceas in this experiment. While some models like YOLOE displayed impressive abilities out of the box comparable to fine-tuned models, their precision and accuracy did not exceed the performance of fine-tuned models. Zero shot modeled had similar issues when classifying edge cases, which suggests misclassification error in such cases is caused by the inherent difficulty of classification than by confusing labeled data.

Beyond improved coordination, strategies to expand the labeled dataset efficiently could be explored. Semi-supervised learning techniques using current models or models trained on more consistently labeled data as teachers, could work. Furthermore, synthetic data generation may be explored using image generation models like GAN or diffusion.

Model Choice¶

Model Precision Recall mAP50 mAP50-95
Yolo11 Default 0.715 0.735 0.718 0.335
Yolo11 More Epochs 0.754 0.673 0.705 0.345
Yolo11 Best of 10 0.809 0.712 0.733 0.384
Yolo11 Best of 274 0.755 0.681 0.717 0.367
YoloE 0.635 0.699 0.691 0.45
Qwen 2.5 VL 0.663 0.54 0.452 0.284

Table 2: Overall Metric comparisons for the six models produced. Bolded represent the best metric in each category. note that the first two YOLO11 models were trained on a slightly different dataset than the other 4.

Our results strongly support the conclusion that fine tuned modeled outperformed zero-shot models in most cases. While zero-shot performance with YOLOE achieved decent results, particularly with mAP at the 50-95 threshold, it consistently fell below even models fined-tuned with default parameters. With small model architectures like YOLO the cost of fine tuning is also small, with the most significant bottleneck being labeling and not compute.

Our results suggest that object detection is likely not the optimal use case for Vision Language Models (VLMs) such as Qwen 2.5 VL. While demonstrating a relatively high level of contextual understanding when asked to provide textual descriptions of images, Qwen exhibited significantly lower object detection performance compared to Convolutional Neural Network (CNN)-based models like YOLO, even zero-shot models such as YOLOe. Furthermore, Qwen proved reluctant to generate bounding boxes, and those it did create were often inaccurate. Qwen would likely perform better in conversational contexts with more continuous human feedback, where greater flexibility is desired than in simple object detection.

Hyperparameter tuning, utilizing the standard YOLO tuning function and a genetic algorithm, yielded mixed results. While tuned hyperparameters generally led to improved performance, the highest recall was observed in the default model. Moreover, fitness remained essentially flat across both rounds of hyperparameter tuning, suggesting the algorithm was unable to orient itself within the search space to identify superior models. Notably, tuning with fewer epochs did not prove beneficial, as the model trained on hyperparameters found from a search of 10 iterations of 100 epochs outperformed the model trained on hyperparameters from a search of 274 iterations of 30 epochs. Reducing the number of augmentations appeared to improve model performance, though that might lead to overfitting.

An analogy can be drawn between hyperparameter optimization in fine-tuning and prompt selection in zero-shot analysis. Both YOLOe and Qwen require a specific prompt to guide their operation, and observing bounding boxes from different prompts demonstrated prompt variations can influence model outcomes. While this did not appear to significantly affect YOLOE when it was evaluated, it demonstrably impacted Qwen. Therefore, even though not technically a hyperparameter, prompt engineering would be a crucial component of any deployment workflow for zero-shot models.

Of all potential architectural or model improvements, YOLOE demonstrates the most promise for further study. Given its relatively small size, fine-tuning would be computationally feasible, and potential studies could compare model performance after tuning the entire model versus tuning only the SAVPE or RepRTA submodules. Indeed, the YOLO documentation specifically mentions tuning only these modules as a strategy for overcoming limited labeled training data.

Domain Adaptation¶

Validating the YOLO models on the dataset they were not trained on demonstrates the value of including more null images. As seen in tables 3 and 4, YOLO models trained on the original dataset performed worse when evaluated on the dataset with null images than vice versa. This indicates that incorporating a small amount of null images improves model generalizability.

Model Precision Recall mAP50 mAP50-95
Yolo11 Default 0.709 0.735 0.714 0.333
Yolo11 More Epochs 0.747 0.673 0.696 0.338

Table 3: Metrics for models trained on Version 1 of the Training Dataset (Without Null Images) on Version 4 of the Training Dataset

Model Precision Recall mAP50 mAP50-95
Yolo11 Best of 10 0.809 0.712 0.734 0.385
Yolo11 Best of 274 0.762 0.681 0.719 0.368

Table 4: Metrics for models trained on Version 4 of the Training Dataset (Without Null Images) on Version 1 of the Training Dataset.

Model Precision Recall mAP50 mAP50-95
YOLOe 0.651 0.699 0.695 0.453
Qwen 2.5 VL 0.71 0.5 0.565 0.327

Table 5: Metrics for zero-shot models trained on Version 1 of the Training Dataset.

Furthermore, it is clear that Version 1 of the dataset represented an easier task, as evidenced by the improved performance of the zero-shot models on Dataset 1 compared to their results on Dataset 4. Still, this suggests that training on harder, more varied datasets generally improves performance on easier datasets, a finding well-supported in the existing literature.

In conclusion, this research reinforces the value of traditional supervised learning with high-quality, representative data. While advancements in zero-shot learning are promising, in this experiment they fell short of the performance achievable through fine-tuning SOTA object detection models like YOLO11. Future work could explore fine tuning YOLOE, semi-supervised learning, and synthetic data to unlock further increase accuracy in deep learning for automated strawberry phenotyping.

Citations¶

  1. Computer Vision and Machine Learning in Agriculture. (Springer Singapore, Singapore, 2021). doi:10.1007/978-981-33-6424-0.
  2. What is Computer Vision? | IBM. https://www.ibm.com/think/topics/computer-vision.
  3. Phenomics: How Next-Generation Phenotyping Is Revolutionizing Plant Breeding. (Springer International Publishing, Cham, 2015). doi:10.1007/978-3-319-13677-6.
  4. Wang, X. Phenomics for Strawberry Breeding. (2025).
  5. Wang, Y. et al. DSE-YOLO: Detail semantics enhancement YOLO for multi-stage strawberry detection. Computers and Electronics in Agriculture 198, 107057 (2022).
  6. Wang, C. et al. Strawberry Detection and Ripeness Classification Using YOLOv8+ Model and Image Processing Method. Agriculture 14, 751 (2024).
  7. onnx/models. Open Neural Network Exchange (2025).
  8. Ultralytics. Home. https://docs.ultralytics.com/.
  9. Bai, J. et al. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. Preprint at https://doi.org/10.48550/arXiv.2308.12966 (2023).
  10. Ultralytics. YOLOE (Real-Time Seeing Anything). https://docs.ultralytics.com/models/yoloe.
  11. Ultralytics. YOLO11 🚀 NEW. https://docs.ultralytics.com/models/yolo11.
  12. Ultralytics. Train. https://docs.ultralytics.com/modes/train.
  13. Khanam, R. & Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. Preprint at https://doi.org/10.48550/arXiv.2410.17725 (2024).
  14. Wang, A. et al. YOLOE: Real-Time Seeing Anything. Preprint at https://doi.org/10.48550/arXiv.2503.07465 (2025).
  15. Qwen/Qwen-VL · Hugging Face. https://huggingface.co/Qwen/Qwen-VL.
  16. Alabdulmohsin, I. & Neyshabur, B. Revisiting Neural Scaling Laws in Language and Vision.
  17. Prato, G., Guiroy, S., Caballero, E., Rish, I. & Chandar, S. Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers. Preprint